The data set analyzed can be obtained from the Kaggle platform. https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023 The Youtuber has 29 features and 995 observations.
In this modelling ,GPT supplied me some idea about feature enigneering and coding
Load libraries
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggthemes)
library(timeDate)
library(lubridate)
library(reshape2)
library(tidyr)
library(knitr)
library(ROCR)
library(rpart)
library(grid)
library(tidyverse)
library(data.table)
library(vtreat)
library(mlr)
library(caret)
library(randomForest)
library(rpart.plot)
library(VIM)
library(ROCit)
library(xgboost)
library(lime)
library(fastDummies)
library(kableExtra)
library(fpc)
library(factoextra)
library(ape)
library(NbClust)
library(corrplot)
library(cluster)
loading data
youtube_data <- read.csv("youtube_UTF_8.csv",check.names = FALSE)
Although the announcement mentioned that there is no need to repeat the cleaning process for the youtuber data that has already been cleaned, considering the different goals of this project and Project 1, as well as Tutor Scott’s suggestion regarding Project 2 data preparation mentioned in Teams, we have decided to clean the youtuber data again, utilizing a different approach for handling the NaN values compared to Project 1.
colnames(youtube_data)[colnames(youtube_data) == "video views"] <- "video_views"
colnames(youtube_data)[colnames(youtube_data) == "Unemployment rate"] <- "Unemployment_rate"
colnames(youtube_data)[colnames(youtube_data) == "Gross tertiary education enrollment (%)"] <- "Gross_tertiary_education_enrollment"
youtube_data <- youtube_data %>%
mutate_at(vars(subscribers, video_views, uploads, subscribers_for_last_30_days,
video_views_for_the_last_30_days, lowest_monthly_earnings,
highest_monthly_earnings, lowest_yearly_earnings,
highest_yearly_earnings, video_views_rank, country_rank, channel_type_rank, created_year,
`Gross_tertiary_education_enrollment`, Population,
`Unemployment_rate`, Urban_population, Latitude,
Longitude), as.numeric)
# Create 'avg_yearly_earnings'
youtube_data <- within(youtube_data, {
subscribers_for_last_30_days <- ifelse(subscribers_for_last_30_days < 0, NA, subscribers_for_last_30_days)
video_views_for_the_last_30_days <- ifelse(video_views_for_the_last_30_days < 0, NA, video_views_for_the_last_30_days)
avg_earnings <- ((highest_yearly_earnings + highest_monthly_earnings * 12) / 2+(lowest_yearly_earnings + lowest_monthly_earnings * 12) / 2)/2
}
)
The target and GDP national information of project2 are highly correlated, so we will address that. Handling missing values by removing nan based on the characteristics of the data we are going to predict
# count missing values
# Adapted from lecture slides
count_missing <- function(df) {
sapply(df, FUN = function(col) sum(is.na(col)) )
}
nacounts <- count_missing(youtube_data)
hasNA = which(nacounts > 0)
nacounts[hasNA]
## video_views_rank country_rank
## 1 116
## channel_type_rank video_views_for_the_last_30_days
## 33 56
## subscribers_for_last_30_days created_year
## 337 5
## created_date Gross_tertiary_education_enrollment
## 5 123
## Population Unemployment_rate
## 123 123
## Urban_population Latitude
## 123 123
## Longitude
## 123
The target and GDP national information of project2 are highly correlated, so we will address that. Handling missing values by removing nan based on the characteristics of the data we are going to predict.Any invalid and country-related rows will be removed,
filtered_data <- youtube_data %>%
filter(category != "nan",
channel_type != "nan",
Country != "nan",
Latitude !="NaN",
created_year!="NaN",
Gross_tertiary_education_enrollment !="NaN"
)
Upon examining the dataset, we found that the variable [subscribers_for_last_30_days] has a high proportion of missing values, accounting for 32% of the data. Such a substantial amount of missing information can introduce bias, reduce the statistical power of any analysis, and potentially lead to misleading results. Given the extent of missing data for this particular variable and the challenges associated with imputation strategies that might introduce further uncertainty, we’ve decided to exclude [subscribers_for_last_30_days] for our modelling
[video_views_for_the_last_30_days] is usually regarded as a crucial metric for video websites, and there are not many missing values for this attribute here. Therefore, I am using the KNN method to fill in its missing values. Firstly, we analyze the correlation coefficient between this attribute and other columns.
numeric_df <- filtered_data[sapply(filtered_data, is.numeric)]
cor_matrix <- cor(numeric_df, use="complete.obs")
cor_with_A <- cor_matrix[,"video_views_for_the_last_30_days"]
sorted_cor_with_A <- sort(cor_with_A, decreasing=TRUE)
cor_data <- data.frame(Feature = names(sorted_cor_with_A), Correlation = sorted_cor_with_A)
ggplot(cor_data, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Correlation with Column video_views_for_the_last_30_days", x = "Features", y = "Correlation Coefficient") +
theme_minimal()
Our columns with high correlation coefficients are taken out and
normalized to [video_views_for_the_last_30_days] and then filled with
knn() in the package vim. We can see here that the earnings correlation
coefficient represents the highest, so we take a representative value
“ave_earnings”
sub_df <- filtered_data[, c("video_views_for_the_last_30_days", "avg_earnings","video_views","subscribers")]
# standarlize sub_df
sub_df_norm <- as.data.frame(lapply(sub_df, function(x) scale(x, center = TRUE, scale = TRUE)))
# KNN fill in missing value
sub_df_filled <- kNN(sub_df_norm)
mean_A <- mean(sub_df$video_views_for_the_last_30_days, na.rm = TRUE)
sd_A <- sd(sub_df$video_views_for_the_last_30_days, na.rm = TRUE)
# The "video_views_for_the_last_30_days" column after anti-standardization interpolation.
filtered_data$video_views_for_the_last_30_days <- (sub_df_filled$video_views_for_the_last_30_days * sd_A) + mean_A
Project2 will use annual income as the binary target variable, with high or low values. Domain knowledge tells us that income has significant regional differences, which is why we removed rows without geographical location data during the cleaning phase. We will introduce and use the country of the YouTuber to match the corresponding GDP Per capita with PPP(Per capita purchasing power). If the average income is higher than the figure, it will be classified as high earning (1); if it is lower or equal, it will be classified as low earning (0).
# Read the csv related to the country's GDP per captia from https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
gdp_data <- read.csv("API_NY.GDP.PCAP.PP.CD_DS2_en_csv_v2_5871590.csv", check.names = FALSE, skip = 4)
gdp_data <- gdp_data[, !(colnames(gdp_data) == "")] %>%
select(-which(colnames(.) == ""))
# Dealing with the problem of inconsistent country names between the two tables
name_mapping <- data.frame(
old_name = c("Russian Federation", "Korea, Rep.", "Egypt, Arab Rep.", "Turkiye"),
new_name = c("Russia", "South Korea", "Egypt", "Turkey")
)
gdp_data <- gdp_data %>%
left_join(name_mapping, by = c("Country Name" = "old_name")) %>%
mutate(`Country Name` = coalesce(new_name, `Country Name`)) %>%
select('Country Name', '2022')
# Merge the data and get the average earning level 1 or -1.
merged_data <- filtered_data %>%
left_join(gdp_data, by = c("Country" = "Country Name")) %>%
mutate(
earning_levels = as.factor(ifelse(avg_earnings > `2022`, 1, 0))
)
merged_data <- merged_data %>%
rename(GDPperCapita = `2022`) %>%
filter(!is.na(earning_levels))
#Target variable
outcome <- 'earning_levels'
#pos "1" means "high earnings"
pos_label <- "1"
Highly correlated columns and semantic related columns will need to
be removed Firstly,“lowest_monthly_earnings”,
“highest_monthly_earnings”, “lowest_yearly_earnings”,
“highest_yearly_earnings”, directly participate in the calculation of
“avg_yearly_earnings” and determine the target variable. If these
participants are predicted, there will be problems such as excessive
weighting, overfitting of the model, and data leakage.
Secondly,columns that have a unique value for each row should be
discarded like “rank”,“Youtuber”,“Titles”,“channel_type_rank”
Therefore,our candidate.columns:
candidate_columns<-c("subscribers",
"video_views",
"category",
"uploads",
"channel_type",
"video_views_for_the_last_30_days",
"created_year",
"created_month",
"created_date",
"Gross_tertiary_education_enrollment",
"Population",
"Unemployment_rate",
"Urban_population",
"Latitude",
"Longitude"
)
[created_month] is categorical.It have a periodic meaning, so here we convert it from categorical variable into periodic numerical variable (cos and sin) to improve.This variable is characterized by reliability. [created_day] is numerical have a periodic meaning.
month_map <- c("Jan" = 1, "Feb" = 2, "Mar" = 3, "Apr" = 4, "May" = 5, "Jun" = 6, "Jul" = 7, "Aug" = 8, "Sep" = 9, "Oct" = 10, "Nov" = 11, "Dec" = 12)
merged_data$created_month<- as.numeric(month_map[merged_data$created_month])
merged_data$cos_month <- cos((merged_data$created_month - 1) * 2 * pi / 12)
merged_data$sin_month <- sin((merged_data$created_month - 1) * 2 * pi / 12)
merged_data$cos_day <- cos((merged_data$created_date - 1) * 2 * pi / 12)
merged_data$sin_day<- sin((merged_data$created_date - 1) * 2 * pi / 12)
At the moment we have only two categorical variables, “category”, “channel_type” In the project1 data EDA stage of these two types, we know that the values of these two variables are unordered and unleveled. The main task of project2 is modeling. So let’s first observe whether the sample of a certain value is too small. This is done in order to:
Stability: Classes with small samples can introduce noise, resulting in model instability.
Avoid overfitting: If a class has few samples, the model may overfit those samples.
Simplifying the model: Reducing the number of categories simplifies the model, making it easier to interpret and understand.。
#category
table(merged_data$category)
##
## Autos & Vehicles Comedy Education
## 2 66 42
## Entertainment Film & Animation Gaming
## 211 36 79
## Howto & Style Movies Music
## 32 2 179
## News & Politics Nonprofits & Activism People & Blogs
## 26 2 101
## Pets & Animals Science & Technology Shows
## 3 14 12
## Sports Trailers Travel & Events
## 11 2 1
#channel_type
table(merged_data$channel_type)
##
## Animals Autos Comedy Education Entertainment
## 3 2 47 44 273
## Film Games Howto Music News
## 34 80 31 194 29
## Nonprofit People Sports Tech
## 2 57 11 14
Merging type
merged_data$category[merged_data$category %in% c("Autos & Vehicles", "Movies", "Nonprofits & Activism","Trailers","Travel & Events","Pets & Animals")] <- "other"
merged_data$channel_type[merged_data$category %in% c("Autos", "Animals", "Nonprofit")] <- "other"
so our final candidate_columns
candidate_columns <- candidate_columns[!candidate_columns %in% c("created_month", "created_date")]
candidate_columns = c(candidate_columns,"cos_month","sin_month","cos_day","sin_day")
data <- merged_data[,c(outcome, candidate_columns)]
round(prop.table(table(data$earning_levels)), 2)
##
## 0 1
## 0.11 0.89
The data splited into a test set and a training set in a ratio of 10:90, and then further partition 10% of the training set as a validation set.
set.seed(10161103)
data$rgroup <- runif(dim(data)[1])
dTrainAll <- subset(data, rgroup <= 0.9)
# Test data set
dTest <- subset(data, rgroup > 0.9)
useForCal <- rbinom(n = dim(dTrainAll)[1],
size = 1,
prob = 0.1) > 0
# Calibration data set
dCal <- subset(dTrainAll, useForCal)
# Trainning data set
dTrain <- subset(dTrainAll,!useForCal)
Splitting the data into Categorical and numerical variables
allVars <- setdiff(colnames(dTrainAll), c(outcome, 'rgroup'))
catVars <- allVars[sapply(dTrainAll[, allVars], class) %in% c('factor', 'character')]
numericVars <- allVars[sapply(dTrainAll[, allVars], class) %in% c('numeric', 'integer')]
Categorical variables
catVars
## [1] "category" "channel_type"
Numerical variables
numericVars
## [1] "subscribers" "video_views"
## [3] "uploads" "video_views_for_the_last_30_days"
## [5] "created_year" "Gross_tertiary_education_enrollment"
## [7] "Population" "Unemployment_rate"
## [9] "Urban_population" "Latitude"
## [11] "Longitude" "cos_month"
## [13] "sin_month" "cos_day"
## [15] "sin_day"
This function employs empirical frequency estimation, with a touch of Laplace smoothing, to compute the predicted probabilities of the positive class for each value in “appCol”. This approach is reminiscent of the Naive Bayes classifier’s methodology when considering a single feature.Here is for single categorical variable.
Single Categorical Variable Model
mkPredC<- function(outCol,varCol,appCol,pos=pos_label) {
pPos <- sum(outCol==pos)/length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol),varCol)
pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}
for(v in catVars) {
pii <- paste('pred',v,sep='')
dTrain[,pii] <- mkPredC(dTrain[,outcome], dTrain[,v], dTrain[,v])
dCal[,pii] <- mkPredC(dTrain[,outcome], dTrain[,v], dCal[,v])
dTest[,pii] <- mkPredC(dTrain[,outcome], dTrain[,v], dTest[,v])
}
Evaluate AUC for the single-variable models
calcAUC <- function(predcol, outcol, pos=pos_label) {
perf <- ROCR::performance(prediction(predcol, outcol==pos),'auc')
as.numeric(perf@y.values)
}
params <- c()
auc_values <- c()
set_types <- c()
for(v in catVars) {
pii <- paste('pred',v,sep='')
aucTrain <- calcAUC(dTrain[,pii],dTrain[,outcome])
if(aucTrain >= 0.5){
aucCal <- calcAUC(dCal[,pii],dCal[,outcome])
print(sprintf("%s, trainAUC: %4.3f calibrationAUC: %4.3f", pii, aucTrain, aucCal))
params <- c(params, v, v)
auc_values <- c(auc_values, aucTrain, aucCal)
set_types <- c(set_types, "Training", "Calibration")
}
}
## [1] "predcategory, trainAUC: 0.636 calibrationAUC: 0.545"
## [1] "predchannel_type, trainAUC: 0.638 calibrationAUC: 0.433"
It can be observed that predchannel_type performs poorly in the calibration set, even lower than the null model. The null model is usually considered as a baseline and has an AUC of 0.5. To ensure a stable model evaluation, we employ K-fold cross-validation to validate their original AUC.
K-fold cross-validation for AUC(categorical variables) A single data split (such as a train/test split) might lead to sporadic or biased performance evaluations, especially when the data is imbalanced or limited in size. Through K-fold cross-validation, we obtain an average performance of the model across multiple distinct data subsets, thus providing a more stable and reliable performance estimate.
# K = 100
vars_cate_fold <- c('category', 'channel_type')
for (var in vars_cate_fold) {
aucs <- rep(0, 100)
for (rep in 1:length(aucs)) {
useForCalRep <- rbinom(n = nrow(dTrainAll),
size = 1,
prob = 0.2) > 0
predRep <- mkPredC(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, var],
dTrainAll[useForCalRep, var])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
print(sprintf("%s: mean: %4.3f; sd: %4.3f", var, mean(aucs), sd(aucs)))
}
## [1] "category: mean: 0.531; sd: 0.070"
## [1] "channel_type: mean: 0.517; sd: 0.068"
The AUC of the “category” computed in 100-fold cross-validation is very close to the “calibration AUC” computed above. The “category” performs better and more consistently in the two categorical variables.
The ROC Curve(categorical) Let us proceed with using ROC in to further validate these two single categorical models.
# The function plot_roc is designed to plot a ROC curve
plot_roc <- function(predcol, outcol, colour_id = 2, overlaid = F) {
ROCit_obj <- rocit(score = predcol, class = outcol == pos_label) # This is the column or vector containing the predicted scores or probabilities.
par(new = overlaid)
plot(ROCit_obj, col = c(colour_id, 1), legend = FALSE, YIndex = FALSE, values = FALSE)
}
plot_roc(dTest$predcategory, dTest[, outcome],colour_id = 6)
plot_roc(dTest$predchannel_type, dTest[, outcome], colour_id = 4, overlaid = T)
legend('bottomright', legend = c("Category", "Channel Type"), col = c(6, 4), lty = 1)
From the ROC graph, it can be observed that the two single categorical
variables models are close to the line of no discrimination(null model
Roc), indicating that the predictive performance of the models is as
good as random guessing.
Single Numerical Variables Model
This function is to discretize numerical features into categorical features and build single variable predictions for numerical variables.
mkPredN <- function(outCol, varCol, appCol) {
cuts <- unique(as.numeric(
quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T)))
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
mkPredC(outCol, varC, appC)
}
for(v in numericVars) {
pii <- paste('pred', v, sep = '')
dTrain[, pii] <- mkPredN(dTrain[, outcome], dTrain[, v], dTrain[, v])
dTest[, pii] <- mkPredN(dTrain[, outcome], dTrain[, v], dTest[, v])
dCal[, pii] <- mkPredN(dTrain[, outcome], dTrain[, v], dCal[, v])
aucTrain <- calcAUC(dTrain[, pii], dTrain[, outcome])
if (aucTrain >= 0.5) {
aucCal <- calcAUC(dCal[, pii], dCal[, outcome])
print(sprintf(
"%s: trainAUC: %4.3f; calibrationAUC: %4.3f",
pii,
aucTrain,
aucCal
))
params <- c(params, v, v)
auc_values <- c(auc_values, aucTrain, aucCal)
set_types <- c(set_types, "Training", "Calibration")
}
}
## [1] "predsubscribers: trainAUC: 0.607; calibrationAUC: 0.475"
## [1] "predvideo_views: trainAUC: 0.625; calibrationAUC: 0.525"
## [1] "preduploads: trainAUC: 0.817; calibrationAUC: 0.932"
## [1] "predvideo_views_for_the_last_30_days: trainAUC: 0.931; calibrationAUC: 0.876"
## [1] "predcreated_year: trainAUC: 0.616; calibrationAUC: 0.578"
## [1] "predGross_tertiary_education_enrollment: trainAUC: 0.613; calibrationAUC: 0.510"
## [1] "predPopulation: trainAUC: 0.591; calibrationAUC: 0.596"
## [1] "predUnemployment_rate: trainAUC: 0.590; calibrationAUC: 0.648"
## [1] "predUrban_population: trainAUC: 0.593; calibrationAUC: 0.600"
## [1] "predLatitude: trainAUC: 0.616; calibrationAUC: 0.545"
## [1] "predLongitude: trainAUC: 0.608; calibrationAUC: 0.555"
## [1] "predcos_month: trainAUC: 0.597; calibrationAUC: 0.512"
## [1] "predsin_month: trainAUC: 0.625; calibrationAUC: 0.632"
## [1] "predcos_day: trainAUC: 0.639; calibrationAUC: 0.493"
## [1] "predsin_day: trainAUC: 0.596; calibrationAUC: 0.377"
Four values were selected to make a progress analysis and observation. They are “uploads” and “video_views_for_the_last_30_days”, respectively, because these two have high prediction scores, but it is possible to overfit. “Population”and “Urban_population”, the two variables TrainAuc and calibrationAUC are very close, both around 0.6.
K-fold cross-validation for AUC(numerical variables)
# K = 100
vars_cate_fold <- c('uploads', 'video_views_for_the_last_30_days','Population',"Urban_population")
for (var in vars_cate_fold) {
aucs <- rep(0, 100)
for (rep in 1:length(aucs)) {
useForCalRep <- rbinom(n = nrow(dTrainAll),
size = 1,
prob = 0.1) > 0
predRep <- mkPredC(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, var],
dTrainAll[useForCalRep, var])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
print(sprintf("%s: mean: %4.3f; sd: %4.3f", var, mean(aucs), sd(aucs)))
}
## [1] "uploads: mean: 0.562; sd: 0.138"
## [1] "video_views_for_the_last_30_days: mean: 0.513; sd: 0.101"
## [1] "Population: mean: 0.500; sd: 0.000"
## [1] "Urban_population: mean: 0.500; sd: 0.000"
It can be observed that the original “population” and “urban_population”, after 100-fold in the training set, have the same AUC as the null model. The AUC (100-fold mean) for uploads and video views for the last 30 days has significantly decreased compared to before, indicating that overfitting is likely to have occurred in the previous speculation. This means that these two univariate models may have learned from random noise in the training data and failed to capture the true underlying patterns.
The ROC curve(numerical variables)
Plotting ROC curve for Test set
plot_roc(dTest$preduploads, dTest[, outcome],colour_id = 6)
plot_roc(dTest$predvideo_views_for_the_last_30_days, dTest[, outcome], colour_id = 4, overlaid = T)
plot_roc(dTest$predPopulation, dTest[, outcome], colour_id = 5, overlaid = T)
plot_roc(dTest$predUrban_population, dTest[, outcome], colour_id = 3, overlaid = T)
legend('bottomright', legend = c("Category", "Channel Type","Population","Urban_population"), col = c(6, 4,5,3), lty = 1)
Plot AUC all tpye single variables
allselectedAuc_df<-data.frame(Parameters=params, AUC=auc_values, Set=set_types)
p_single_Auc <- ggplot(allselectedAuc_df, aes(x=Parameters, y=AUC, fill=Set, label=round(AUC, 2))) +
geom_bar(stat="identity", position=position_dodge(0.6), width=0.6) +
theme_minimal() +
labs(title="Training and Calibration AUCs of Single Vars Model", x="Parameters", y="AUC") +
scale_fill_manual(values=c("Training"="lightblue", "Calibration"="orange")) +
geom_text(aes(label=round(AUC, 2)), position=position_dodge(0.3), hjust=-0.2, size=2) +
coord_flip()
print(p_single_Auc)
By comparing the deviance of a model that includes a specific feature with that of a model that excludes the feature, one can assess the contribution of an individual feature to the model. In Project 2, the deviance of single-variable models is compared to that of the null model to select the good variables
Deviance can be defined as for Binary target :
\[ \text{Deviance} = -2 \times \text{log-likelihood of the model} \]
Deviance Reduction is the difference in bias between two models:
\[ \text{Deviance Reduction} = \text{Deviance}_{\text{null model}} - \text{Deviance}_{\text{current model}} \]
Compute a log-likelihood of the null model
logLikelihood <- function(ypred, ytrue) {
sum(ifelse(ytrue, log(ypred), log(1-ypred)), na.rm=T)
}
#This calculates the log-likelihood of the null model using the base rate and the actual outcomes.
logNull <- logLikelihood(sum(dCal[,outcome]==pos_label)/nrow(dCal), dCal[,outcome]==pos_label)
cat("The log likelihood of the Null model(baseRate) is:", logNull)
## The log likelihood of the Null model(baseRate) is: -29.59267
For each feature, the code computes the deviance reduction by comparing the log-likelihood of a model using that feature against the log-likelihood of the null model. If the deviance reduction exceeds the threshold (minDrop), the feature is added to the list of selected features (com1_selVars).
com1_selVars<- c()
minDrop <- 10
for (v in allVars) {
pi <- paste('pred', v, sep='')
devDrop <- 2*(logLikelihood(dCal[, pi], dCal[,outcome]==pos_label) - logNull)
if (devDrop >= minDrop) {
cat(sprintf("%6s, deviance reduction: %g\n", pi, devDrop))
com1_selVars <- c(com1_selVars, v)
}
}
## preduploads, deviance reduction: 32.737
## predvideo_views_for_the_last_30_days, deviance reduction: 29.0051
The Point-Biserial Correlation Coefficient can be considered a filter method for feature selection. It scores or evaluates each feature before considering its inclusion in the model. It is generally used to assess the relationship between a continuous feature and a binary target variable. In the subsequent modeling. Hence, we employ the Point-Biserial Correlation Coefficient for feature selection here \[ r_{pb} = \frac{{M_1 - M_0}}{{s_p}} \sqrt{\frac{{n_0 n_1}}{{n^2}}} \]
Where: - \(M_1\) and \(M_0\) are the means of \(X\) for \(Y = 1\) and \(Y = 0\) respectively. - \(s_p\) is the population standard deviation. - \(n_0\) and \(n_1\) are the number of samples for \(Y = 0\) and \(Y = 1\) respectively. - \(n\) is the total number of samples.
com2_selVars <- c()
for (v in numericVars) {
#the code computes the Pearson correlation between the feature and the outcome variable.
test <-cor.test(dTrain[,v], ifelse(dTrain[, outcome] == "1", 1, 0), method = "pearson")
if (test$p.value < 0.1) {
#p-value is less than 0.1, indicating the correlation is statistically significant at the 10% level, the feature is added to the list of selected features
com2_selVars <- c(com2_selVars, v)
}
}
com2_selVars
## [1] "video_views" "uploads"
## [3] "Gross_tertiary_education_enrollment" "Population"
## [5] "Urban_population"
It is a feature selection technique that aims to identify the most important features in a model. It works by repeatedly building models and choosing the worst-performing features, removing them from the data set until the desired number of features is reached
Random Forest Trees Model will be selected as a estimator on here. Because random forests can capture non-linear relationships and interactions between features. If certain features become important only due to their interactions with other features, linear models might overlook them, but random forests won’t.
set.seed(10170805)
#functions=rfFuncs =Random Forest Trees
ctrl <- rfeControl(functions=rfFuncs, method="cv", number=10)
results_rfe<- rfe(dTrain[,allVars], dTrain[,outcome], sizes=c(1:15), rfeControl=ctrl)
com3_selVars<-results_rfe$optVariables
com3_selVars
## [1] "video_views_for_the_last_30_days" "uploads"
## [3] "video_views" "Gross_tertiary_education_enrollment"
## [5] "category"
We will evaluate the model using Accuracy, Precision, Recall, F1 score, and the AUC of the model. Each of these values has different significance in various business scenarios.
For example, if we simply want to determine whether the model can accurately distinguish whether a YouTuber’s income is high or low, Accuracy should be given more attention.
However, if there are specific business requirements that focus on the positive instances (high salary), then it is important to prioritize the Precision and Recall of the model.
#A series of performance metrics were calculated based on the model's predicted results and the actual labels.
performance.Measures <-function(model,data,modelName,outcome,modelType,Dataset,threshold=0.5){
if(modelType == "XGBoost"){
predicted_labels<-ifelse(predict(model, data) > threshold, 1, 0)
}else {
predicted_labels<-predict(model,data,type="class")
}
#Create confusion matrix
conf_matrix <- table(predicted_labels, outcome)
#Compute the confusion matrix
TP <- conf_matrix[2,2]
TN <- conf_matrix[1,1]
FP <- conf_matrix[2,1]
FN <- conf_matrix[1,2]
Accuracy <- (TP + TN) / (TP + TN + FP + FN)
Precision <- TP / (TP + FP)
Recall <- TP / (TP + FN)
F1 <- 2 * (Precision * Recall) / (Precision + Recall)
#AUC
if (modelType == "XGBoost") {
AUC <- calcAUC(predict(model, data), outcome)
} else {
AUC <- calcAUC(predict(model, data)[, 2], outcome)
}
metrics_table <- data.frame(
Model.Name = modelName,
Model.Type = modelType,
Dataset = Dataset,
Accuracy = Accuracy,
Precision = Precision,
Recall = Recall,
F1_Score = F1,
AUC = AUC
)
return(metrics_table)
}
#Simultaneously drawing ROC curves for multiple datasets facilitates easy comparison.
plot_multiple_roc <- function(model, data_list, label_list, feature_list, legend_labels, colours) {
for (i in 1:length(data_list)) {
pred_probs <- predict(model, newdata = data_list[[i]][, feature_list[[i]]], type = "prob")[,2]
true_labels <- data_list[[i]][, label_list[[i]]]
plot_roc(pred_probs, true_labels, colour_id = colours[i], overlaid = i != 1)
}
legend('bottomright', legend = legend_labels, col = colours, lty = 1)
}
## just for aesthetic
format_table <- function(data, caption) {
formatted_table <- kable(data, format = "html", caption = caption, table.attr = "class='table table-striped'") %>%
kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover")) %>%
column_spec(1, bold = TRUE, color = "white", background = "lightblue") %>%
row_spec(0, bold = TRUE, color = "white", background = "orange", align = "c")
return(formatted_table)
}
Before establishing a classifier model, we observe an extremely imbalanced data set of the target data with a ratio of 11:89 between positive and negative samples.
round(prop.table(table(data$earning_levels)), 2)
##
## 0 1
## 0.11 0.89
In imbalanced data set, the number of samples for the minority class is significantly lower than that of the majority class. Without any intervention, the model may become overly sensitive to the majority class and neglect the minority class. However,So, here we use the creation of a simple function to adjust the weight ratio of positive and negative examples.
adjust_weights <- function(data, outcome,pos) {
# Calculate weights !=pos mean negative sample
minority_weight <- sum(data[,outcome] != pos) / sum(data[,outcome] == pos)
majority_weight <- 1
# Assign weights based on outcome values
weights <- ifelse(data[,outcome] == pos, minority_weight, majority_weight)
return(weights)
}
# Train set adjust weight ratio
dTrain$weights<-adjust_weights(dTrain,outcome,pos_label)
# Test and Cal do not need to adjust weights
dTest$weights<-1
dCal$weights<-1
# Formula
fV <- paste(outcome,'~', paste(com1_selVars, collapse=' + '), sep='')
#Build a dtree model
dtree.comVars1.model <- rpart(fV, data=dTrain, method="class", weights = dTrain$weights,control=rpart.control(cp=0.001, minsplit=30,
minbucket=30, maxdepth=3))
#It generates a tree diagram for the decision tree, displaying the various nodes in the tree,
# the splitting criteria, and the labels of the nodes (for classification problems).
rpart.plot(dtree.comVars1.model)
ROC Curves
data_list = list(dTrain, dCal, dTest)
label_list = list(outcome, outcome, outcome)
feature_list = list(com1_selVars, com1_selVars, com1_selVars)
plot_multiple_roc(dtree.comVars1.model, data_list, label_list, feature_list, c("Training", "Calibration", "Test"), c(4, 5, 6))
Performance Metrics
#function(model,data,modelName,outcome,modelType,Dataset)
dtree.comVars1.metrics <- bind_rows(
performance.Measures(
dtree.comVars1.model,
dTrain[, com1_selVars],
"dtree.comVars1.model",
dTrain[, outcome],
"Decision Tree",
"Training"
),
performance.Measures(
dtree.comVars1.model,
dTest[, com1_selVars],
"dtree.comVars1.model",
dTest[, outcome],
"Decision Tree",
"Test"
)
)
format_table(dtree.comVars1.metrics,"dtree.comVars1.model Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| dtree.comVars1.model | Decision Tree | Training | 0.9324117 | 0.9805654 | 0.9438776 | 0.9618718 | 0.901806 |
| dtree.comVars1.model | Decision Tree | Test | 0.9148936 | 0.9615385 | 0.9375000 | 0.9493671 | 0.871875 |
It can be seen that the performance of this model is very good from the data, with an AUC of 0.87 for the test set. However, the precision score is a bit too high, which raises suspicion of overfitting. Also, from the ROC curve, it is evident that there is too much bias in the calibration set.
fV_co2 <-
paste(outcome, '~', paste(com2_selVars, collapse = ' + '), sep = '')
dtree.comVars2.model <-
rpart(
fV_co2,
data = dTrain,
method = "class",
weights = dTrain$weights,
control = rpart.control(
cp = 0.001,
minsplit = 50,
minbucket = 50,
maxdepth = 5
)
)
rpart.plot(dtree.comVars2.model)
ROC Curves
data_list = list(dTrain, dCal, dTest)
label_list = list(outcome, outcome, outcome)
feature_list = list(com2_selVars, com2_selVars, com2_selVars)
plot_multiple_roc(dtree.comVars2.model, data_list, label_list, feature_list, c("Training", "Calibration", "Test"), c(4, 5, 6))
dtree.comVars2.metrics <- bind_rows(
performance.Measures(
dtree.comVars2.model,
dTrain[, com2_selVars],
"dtree.comVars2.model",
dTrain[, outcome],
"Decision Tree",
"Training"
),
performance.Measures(
dtree.comVars2.model,
dTest[, com2_selVars],
"dtree.comVars2.model",
dTest[, outcome],
"Decision Tree",
"Test"
),
)
format_table(dtree.comVars2.metrics,"dtree.comVars2.model Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| dtree.comVars2.model | Decision Tree | Training | 0.8371736 | 0.9707031 | 0.8452381 | 0.9036364 | 0.8295270 |
| dtree.comVars2.model | Decision Tree | Test | 0.8085106 | 0.9558824 | 0.8125000 | 0.8783784 | 0.8008929 |
Compared to dtree.comVars1.model, although the various indicators have decreased, they have become more reasonable. The ratio of positive to negative examples in the original data is 89:11. Although we used the relatively crude adjust_weights() function to adjust the weight ratio, the performance of the ROC curve on the calibration set shows that there is no data leakage or overfitting.
fV_co3<- paste(outcome,'~', paste(com3_selVars, collapse=' + '), sep='')
dtree.comVars3.model <- rpart(fV_co3, data=dTrain, method="class",weights = dTrain$weights,control=rpart.control(cp=0.001, minsplit=50,
minbucket=50, maxdepth=5))
rpart.plot(dtree.comVars3.model)
feature_list_3 = list(com3_selVars, com3_selVars, com3_selVars)
plot_multiple_roc(dtree.comVars3.model, data_list, label_list, feature_list_3, c("Training", "Calibration", "Test"), c(4, 5, 6))
dtree.comVars3.metrics <- bind_rows(
performance.Measures(
dtree.comVars3.model,
dTrain[, com3_selVars],
"dtree.comVars3.model",
dTrain[, outcome],
"Decision Tree",
"Training"
),
performance.Measures(
dtree.comVars3.model,
dTest[, com3_selVars],
"dtree.comVars3.model",
dTest[, outcome],
"Decision Tree",
"Test"
),
)
format_table(dtree.comVars3.metrics,"dtree.comVars3.model Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| dtree.comVars3.model | Decision Tree | Training | 0.8878648 | 0.9849341 | 0.8894558 | 0.9347632 | 0.9188937 |
| dtree.comVars3.model | Decision Tree | Test | 0.8191489 | 0.9436620 | 0.8375000 | 0.8874172 | 0.8191964 |
Comparison of ROC curves of three decision tree models on the test set.
plot_roc(predict(dtree.comVars1.model, newdata = dTest[,com1_selVars],type = "prob")[,2], dTest[,outcome],colour_id = 4)
plot_roc(predict(dtree.comVars2.model, newdata = dTest[,com2_selVars],type = "prob")[,2], dTest[,outcome],colour_id = 5,overlaid = T)
plot_roc(predict(dtree.comVars3.model, newdata = dTest[,com3_selVars],type = "prob")[,2], dTest[,outcome],colour_id = 6,overlaid = T)
legend('bottomright', legend = c("dtree.comVars1.model", "dtree.comVars2.model","dtree.comVars3.model"), col = c(4, 5, 6), lty = 1)
XGBoost can be seen as an “enhanced version” of decision trees.XGBoost is based on the gradient boosting framework and combines the ideas of gradient boosting with various optimization techniques.It builds multiple decision trees and combines them to get the final prediction,and XGBoost adds regularization terms (L1 and L2) to its objective function, which helps prevent the model from overfitting. Traditional decision trees typically do not have regularization. Essenticailly,The function ,xgboost.cv(),can perform cross-validation
# This function prepares the data for XGBoost by converting features to matrix format and target to numeric.
prepare.xgboost <- function(features, target, weights) {
# Convert the target variable to a numerical format
numerical_target <- as.numeric(as.character(target))
# Convert the feature set to a matrix format suitable for XGBoost and create a DMatrix object
data_matrix <- xgb.DMatrix(data = as.matrix(features), label = numerical_target,weight = weights)
return(data_matrix)
}
# This function builds an XGBoost model using the provided feature matrix and target vector.
build.XGBoost.model <- function(variable_matrix, labelvec) {
# Perform cross-validation to tune the number of rounds for XGBoost
cv <- xgb.cv(
variable_matrix,
params = list(objective = "binary:logistic"),
nfold = 5,
nrounds = 100,
#Stop model training early to prevent overfitting
early_stopping_rounds = 10,
metrics = "auc"
)
# Determine the optimal number of boosting rounds based on cross-validation
NROUNDS <- which.max(cv$evaluation_log$test_auc_mean)
# Train the XGBoost model using the optimal number of boosting rounds
model <- xgboost(
data = variable_matrix,
label = labelvec,
params = list(objective = "binary:logistic"),
nrounds = NROUNDS,
verbose = FALSE
)
return(model)
}
# Features combination(com1_selVars):"uploads","video_views_for_the_last_30_days"
train_dataXgboost_1 <- prepare.xgboost(dTrain[, com1_selVars],dTrain[, outcome],dTrain$weights)
xgb.model.1<-build.XGBoost.model(train_dataXgboost_1,pos_label)
## [1] train-auc:0.919369+0.009398 test-auc:0.881843+0.056688
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.950992+0.017403 test-auc:0.875675+0.057393
## [3] train-auc:0.965795+0.018057 test-auc:0.890112+0.058850
## [4] train-auc:0.972466+0.010372 test-auc:0.889411+0.060374
## [5] train-auc:0.978354+0.007660 test-auc:0.898108+0.054762
## [6] train-auc:0.981356+0.005181 test-auc:0.902300+0.056105
## [7] train-auc:0.983388+0.004006 test-auc:0.897362+0.056208
## [8] train-auc:0.985666+0.002318 test-auc:0.903554+0.059189
## [9] train-auc:0.988585+0.002566 test-auc:0.903180+0.049139
## [10] train-auc:0.989516+0.001704 test-auc:0.905731+0.046772
## [11] train-auc:0.991233+0.001222 test-auc:0.907953+0.039665
## [12] train-auc:0.992179+0.001191 test-auc:0.910678+0.037050
## [13] train-auc:0.992483+0.000980 test-auc:0.914143+0.038713
## [14] train-auc:0.992952+0.001076 test-auc:0.911869+0.038858
## [15] train-auc:0.993594+0.000914 test-auc:0.909330+0.037865
## [16] train-auc:0.993530+0.000791 test-auc:0.910375+0.041775
## [17] train-auc:0.993636+0.000864 test-auc:0.908845+0.042151
## [18] train-auc:0.993754+0.000725 test-auc:0.908390+0.039343
## [19] train-auc:0.994056+0.000680 test-auc:0.908026+0.042883
## [20] train-auc:0.994293+0.000611 test-auc:0.908239+0.041276
## [21] train-auc:0.994443+0.000711 test-auc:0.909014+0.041276
## [22] train-auc:0.994791+0.000845 test-auc:0.908601+0.040322
## [23] train-auc:0.994932+0.000927 test-auc:0.907913+0.040711
## Stopping. Best iteration:
## [13] train-auc:0.992483+0.000980 test-auc:0.914143+0.038713
## Warning in xgb.get.DMatrix(data, label, missing, weight, nthread =
## merged$nthread): xgboost: label will be ignored.
model1_explainer <- lime::lime(dTest[, com1_selVars],, model = xgb.model.1,
bin_continuous = TRUE,n_bins = 2)
model1_explanation <- lime::explain(dTest[1:5, com1_selVars], model1_explainer, n_labels = 1, n_features = 2)
plot_features(model1_explanation)
In xgb.model.1, according to the Lime() method for interpreting the
model, we found that “video_views_for_the_last_30_days” is the
determining feature for both positive and negative
instances.Moreover,The condition “video_views_for_the_last_30_days <=
472253500” has had a significant negative impact on the prediction
result, or in other words, it has decreased the probability of this
category being predicted as 1 (high).
xgb.model.1.metrics <- bind_rows(
performance.Measures(
xgb.model.1,
train_dataXgboost_1,
"xgb.model.1",
dTrain[, outcome],
"XGBoost",
"Training"
),
performance.Measures(
xgb.model.1,
prepare.xgboost(dTest[, com1_selVars],dTest[, outcome],dTest$weights),
"xgb.model.1",
dTest[, outcome],
"XGBoost",
"Test"
)
)
format_table(xgb.model.1.metrics,"xgb.model.1.metrics Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| xgb.model.1 | XGBoost | Training | 0.9631336 | 0.9878893 | 0.9710884 | 0.9794168 | 0.9923469 |
| xgb.model.1 | XGBoost | Test | 0.9255319 | 0.9506173 | 0.9625000 | 0.9565217 | 0.8986607 |
train_dataXgboost_2 <- prepare.xgboost(dTrain[, com2_selVars],dTrain[, outcome],dTrain$weights)
xgb.model.2<-build.XGBoost.model(train_dataXgboost_2,pos_label)
## [1] train-auc:0.898335+0.015140 test-auc:0.717103+0.120727
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.915903+0.021233 test-auc:0.725581+0.112472
## [3] train-auc:0.926860+0.017754 test-auc:0.727386+0.115450
## [4] train-auc:0.938036+0.013097 test-auc:0.748038+0.103587
## [5] train-auc:0.943144+0.010656 test-auc:0.758842+0.096231
## [6] train-auc:0.951266+0.006916 test-auc:0.762376+0.099392
## [7] train-auc:0.957546+0.006718 test-auc:0.759906+0.096580
## [8] train-auc:0.960658+0.005382 test-auc:0.770645+0.094189
## [9] train-auc:0.963526+0.006141 test-auc:0.771385+0.093143
## [10] train-auc:0.966050+0.004929 test-auc:0.772473+0.091343
## [11] train-auc:0.968505+0.004925 test-auc:0.770696+0.092783
## [12] train-auc:0.971492+0.004350 test-auc:0.768336+0.096257
## [13] train-auc:0.973051+0.003398 test-auc:0.768197+0.096755
## [14] train-auc:0.974872+0.003590 test-auc:0.770396+0.095922
## [15] train-auc:0.976231+0.003253 test-auc:0.770273+0.096322
## [16] train-auc:0.977501+0.002516 test-auc:0.770778+0.095346
## [17] train-auc:0.978583+0.002175 test-auc:0.770543+0.095959
## [18] train-auc:0.979848+0.001837 test-auc:0.773730+0.096130
## [19] train-auc:0.981342+0.001391 test-auc:0.772514+0.098937
## [20] train-auc:0.982273+0.001618 test-auc:0.775202+0.099719
## [21] train-auc:0.983595+0.001506 test-auc:0.774092+0.101056
## [22] train-auc:0.984114+0.001445 test-auc:0.774685+0.097970
## [23] train-auc:0.984420+0.001683 test-auc:0.776053+0.098509
## [24] train-auc:0.985031+0.001737 test-auc:0.777765+0.094567
## [25] train-auc:0.985664+0.001809 test-auc:0.776253+0.094881
## [26] train-auc:0.986126+0.002032 test-auc:0.775447+0.095134
## [27] train-auc:0.986829+0.002081 test-auc:0.775730+0.094275
## [28] train-auc:0.987325+0.002097 test-auc:0.776280+0.093053
## [29] train-auc:0.987636+0.002273 test-auc:0.775747+0.094245
## [30] train-auc:0.988124+0.002421 test-auc:0.776920+0.092938
## [31] train-auc:0.988684+0.002528 test-auc:0.776676+0.095388
## [32] train-auc:0.988955+0.002540 test-auc:0.777977+0.095942
## [33] train-auc:0.989148+0.002394 test-auc:0.778076+0.096832
## [34] train-auc:0.989607+0.002259 test-auc:0.777553+0.094799
## [35] train-auc:0.989714+0.002243 test-auc:0.777249+0.095111
## [36] train-auc:0.990129+0.002079 test-auc:0.775952+0.094684
## [37] train-auc:0.990265+0.002138 test-auc:0.776463+0.094806
## [38] train-auc:0.990371+0.002215 test-auc:0.778121+0.094566
## [39] train-auc:0.990498+0.001933 test-auc:0.777649+0.094262
## [40] train-auc:0.990692+0.001867 test-auc:0.776186+0.093956
## [41] train-auc:0.990909+0.001731 test-auc:0.776783+0.093247
## [42] train-auc:0.991095+0.001842 test-auc:0.779005+0.093617
## [43] train-auc:0.991278+0.002016 test-auc:0.777569+0.095013
## [44] train-auc:0.991683+0.001915 test-auc:0.779552+0.093736
## [45] train-auc:0.991825+0.001914 test-auc:0.779052+0.095839
## [46] train-auc:0.991873+0.001968 test-auc:0.777664+0.095300
## [47] train-auc:0.992024+0.001957 test-auc:0.776551+0.095973
## [48] train-auc:0.992177+0.001989 test-auc:0.776490+0.095781
## [49] train-auc:0.992314+0.001982 test-auc:0.778094+0.095845
## [50] train-auc:0.992564+0.001885 test-auc:0.778988+0.095767
## [51] train-auc:0.992718+0.001918 test-auc:0.777451+0.097647
## [52] train-auc:0.992645+0.001850 test-auc:0.778068+0.096515
## [53] train-auc:0.992789+0.001873 test-auc:0.777768+0.095376
## [54] train-auc:0.993034+0.001892 test-auc:0.777151+0.096206
## Stopping. Best iteration:
## [44] train-auc:0.991683+0.001915 test-auc:0.779552+0.093736
## Warning in xgb.get.DMatrix(data, label, missing, weight, nthread =
## merged$nthread): xgboost: label will be ignored.
model2_explainer <- lime::lime(dTest[, com2_selVars],, model = xgb.model.2,
bin_continuous = TRUE,n_bins = 2)
model2_explanation <- lime::explain(dTest[1:5, com2_selVars], model2_explainer, n_labels = 1, n_features = 5)
plot_features(model2_explanation)
In the model interpretation of xgb.model.2, we have observed that the
features with high weights, pertaining to the label 1 which represents
high income prediction (label=1), are all contradictory.In some
applications, the presence of certain features might indeed be
associated with a decrease in the positive class
xgb.model.2.metrics <- bind_rows(
performance.Measures(
xgb.model.2,
train_dataXgboost_2,
"xgb.model.2",
dTrain[, outcome],
"XGBoost",
"Training"
),
performance.Measures(
xgb.model.2,
prepare.xgboost(dTest[, com2_selVars],dTest[, outcome],dTest$weights),
"xgb.model.2",
dTest[, outcome],
"XGBoost",
"Test"
)
)
format_table(xgb.model.2.metrics,"xgb.model.2.metrics Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| xgb.model.2 | XGBoost | Training | 0.9462366 | 1.00 | 0.9404762 | 0.9693252 | 0.9914966 |
| xgb.model.2 | XGBoost | Test | 0.9148936 | 0.95 | 0.9500000 | 0.9500000 | 0.8700893 |
One-hot encoding In this group of features,“category” which is a kind of categorical variable has been selected if we force the categorical type to be converted into a matrix, the categorical variables will become integer variables. Modeling with this feature will result in significant bias because integers are ordered and continuous. Therefore, here we use one-hot encoding for processing
table(dTrain[,"category"])
##
## Comedy Education Entertainment
## 50 35 171
## Film & Animation Gaming Howto & Style
## 29 64 24
## Music News & Politics other
## 139 19 8
## People & Blogs Science & Technology Shows
## 84 12 8
## Sports
## 8
# Use the 'dummy_cols' function to encode the 'category' column in the data set
encoded_dTrain <- dummy_cols(dTrain, select_columns = "category")
encoded_dTest <- dummy_cols(dTest, select_columns = "category")
encoded_dCal <- dummy_cols(dCal, select_columns = "category")
setdiff(colnames(encoded_dTrain),colnames(encoded_dTest))
## [1] "category_Science & Technology"
setdiff(colnames(encoded_dTrain),colnames(encoded_dCal))
## [1] "category_other"
# Add a new column 'category_Science & Technology' to the test set and initialize it with 0s
encoded_dTest$"category_Science & Technology" <- 0
# Add a new column 'category_other' to the calibration set and initialize it with 0s
encoded_dCal$category_other <- 0
# Select variable names from 'com3_selVars' excluding the 'category' column
com3_encoded_selVars <- com3_selVars[com3_selVars != "category"]
# Extract names of the new columns created after encoding
encoded_cols <- setdiff(colnames(encoded_dTrain), colnames(dTrain))
# Prepare the data for XGBoost training by selecting relevant columns and setting target and weights
train_dataXgboost_3 <-
prepare.xgboost(encoded_dTrain[, c(com3_encoded_selVars, encoded_cols)], encoded_dTrain[, outcome], dTrain$weights)
# Train the XGBoost model
xgb.model.3 <- build.XGBoost.model(train_dataXgboost_3, pos_label)
## [1] train-auc:0.956533+0.009552 test-auc:0.876663+0.055679
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
##
## [2] train-auc:0.969859+0.006131 test-auc:0.909730+0.052986
## [3] train-auc:0.978918+0.003995 test-auc:0.908127+0.057337
## [4] train-auc:0.982483+0.002260 test-auc:0.913064+0.047014
## [5] train-auc:0.986547+0.001553 test-auc:0.924361+0.048082
## [6] train-auc:0.988962+0.001159 test-auc:0.923176+0.047670
## [7] train-auc:0.991844+0.000972 test-auc:0.919347+0.046705
## [8] train-auc:0.993348+0.001565 test-auc:0.917836+0.043149
## [9] train-auc:0.994473+0.001139 test-auc:0.921786+0.041274
## [10] train-auc:0.995420+0.000942 test-auc:0.921182+0.037263
## [11] train-auc:0.996151+0.000881 test-auc:0.920296+0.033866
## [12] train-auc:0.996468+0.000931 test-auc:0.917513+0.033811
## [13] train-auc:0.996629+0.000966 test-auc:0.914208+0.032251
## [14] train-auc:0.996852+0.000942 test-auc:0.915065+0.031044
## [15] train-auc:0.997071+0.000831 test-auc:0.913111+0.036085
## Stopping. Best iteration:
## [5] train-auc:0.986547+0.001553 test-auc:0.924361+0.048082
# Create a Lime explainer for the trained XGBoost model
model3_explainer <- lime::lime(encoded_dTest[, c(com3_encoded_selVars, encoded_cols)], model = xgb.model.3,
bin_continuous = TRUE, bin = 3)
model3_explanation <- lime::explain(encoded_dTest[1:5, c(com3_encoded_selVars, encoded_cols)], model3_explainer, n_labels = 1, n_features = 5)
plot_features(model3_explanation)
xgb.model.3.metrics <- bind_rows(
performance.Measures(
xgb.model.3,
train_dataXgboost_3,
"xgb.model.3",
dTrain[, outcome],
"XGBoost",
"Training"
),
performance.Measures(
xgb.model.3,
prepare.xgboost(encoded_dTest[, c(com3_encoded_selVars, encoded_cols)], encoded_dTest[, outcome],dTest$weights),
"xgb.model.3",
dTest[, outcome],
"XGBoost",
"Test"
)
)
format_table(xgb.model.3.metrics,"xgb.model.3.metrics Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| xgb.model.3 | XGBoost | Training | 0.9692780 | 0.9863014 | 0.9795918 | 0.9829352 | 0.9911322 |
| xgb.model.3 | XGBoost | Test | 0.9468085 | 0.9629630 | 0.9750000 | 0.9689441 | 0.8433036 |
dtree.All.Performance<-bind_rows(dtree.comVars1.metrics,dtree.comVars2.metrics,dtree.comVars3.metrics,xgb.model.1.metrics,xgb.model.2.metrics,xgb.model.3.metrics)
format_table(dtree.All.Performance,"All Multivariate Model Performance Metrics")
| Model.Name | Model.Type | Dataset | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|---|---|
| dtree.comVars1.model | Decision Tree | Training | 0.9324117 | 0.9805654 | 0.9438776 | 0.9618718 | 0.9018060 |
| dtree.comVars1.model | Decision Tree | Test | 0.9148936 | 0.9615385 | 0.9375000 | 0.9493671 | 0.8718750 |
| dtree.comVars2.model | Decision Tree | Training | 0.8371736 | 0.9707031 | 0.8452381 | 0.9036364 | 0.8295270 |
| dtree.comVars2.model | Decision Tree | Test | 0.8085106 | 0.9558824 | 0.8125000 | 0.8783784 | 0.8008929 |
| dtree.comVars3.model | Decision Tree | Training | 0.8878648 | 0.9849341 | 0.8894558 | 0.9347632 | 0.9188937 |
| dtree.comVars3.model | Decision Tree | Test | 0.8191489 | 0.9436620 | 0.8375000 | 0.8874172 | 0.8191964 |
| xgb.model.1 | XGBoost | Training | 0.9631336 | 0.9878893 | 0.9710884 | 0.9794168 | 0.9923469 |
| xgb.model.1 | XGBoost | Test | 0.9255319 | 0.9506173 | 0.9625000 | 0.9565217 | 0.8986607 |
| xgb.model.2 | XGBoost | Training | 0.9462366 | 1.0000000 | 0.9404762 | 0.9693252 | 0.9914966 |
| xgb.model.2 | XGBoost | Test | 0.9148936 | 0.9500000 | 0.9500000 | 0.9500000 | 0.8700893 |
| xgb.model.3 | XGBoost | Training | 0.9692780 | 0.9863014 | 0.9795918 | 0.9829352 | 0.9911322 |
| xgb.model.3 | XGBoost | Test | 0.9468085 | 0.9629630 | 0.9750000 | 0.9689441 | 0.8433036 |
On the whole, XGBoost performs slightly better than Decision Tree when using the same set of features, as evidenced by several evaluation metrics. This outcome is anticipated as XGBoost utilizes gradient boosting framework which has the potential to better optimize the objective function resulting in improved performance. However, there are several unexpected indices. For instance, the AUC of xbg.model.3 and xbg.model.1 is remarkably high at 0.99 on the training set, which is almost 1. However, it drops significantly to 0.83 and 0.91 on the test set. This indicates a high probability of overfitting, where the model appears to have “learned too well” from the training set, almost memorizing all the details and noises in the training data. Consequently, the model becomes overly sensitive to the training data, capturing specific features or noises that do not exist in the test data. From the perspective of the original data, the features in xbg.model.3 contain categorical variables with the “category” attribute, which are unordered and non-hierarchical. Additionally, there are numerous categories of these variables, with some types being excessively rare, resulting in imbalanced data distribution.
In this section, we will be discussing Clustering. From a business perspective, I am planning to use a single YouTuber as a dimension and perform K-means clustering on their performance on YouTube. Of course, since clustering is unsupervised learning, there is no need to set a target variable. The purpose of clustering is to discover structure or patterns in the data set.
Firstly, in regards to the variables’ actual meanings, we exclude unique data types such as ranking-related data, including rank, country rank, and video views rank. 1. Every individual Youtuber performance : “including subscribers”, “video_views”, uploads, average earnings, subscribers for the last 30 days, video views for the last 30 days, created year, and related factors.
colnames(filtered_data)
## [1] "rank" "Youtuber"
## [3] "subscribers" "video_views"
## [5] "category" "Title"
## [7] "uploads" "Country"
## [9] "Abbreviation" "channel_type"
## [11] "video_views_rank" "country_rank"
## [13] "channel_type_rank" "video_views_for_the_last_30_days"
## [15] "lowest_monthly_earnings" "highest_monthly_earnings"
## [17] "lowest_yearly_earnings" "highest_yearly_earnings"
## [19] "subscribers_for_last_30_days" "created_year"
## [21] "created_month" "created_date"
## [23] "Gross_tertiary_education_enrollment" "Population"
## [25] "Unemployment_rate" "Urban_population"
## [27] "Latitude" "Longitude"
## [29] "avg_earnings"
# Select the relevant columns
cluster_canidate_columns <-
c(
"subscribers",
"video_views",
"uploads",
"video_views_for_the_last_30_days",
"subscribers_for_last_30_days",
"highest_yearly_earnings",
"avg_earnings",
"lowest_yearly_earnings",
"created_year"
)
# According to the previous article, we know that [subscribers_for_last_30_days]
# this proportion is indeed as high as about 30%, and data will lead to information loss.
cluster_canidate_columns <- setdiff(cluster_canidate_columns, "subscribers_for_last_30_days")
####Data Standardisation
Use the scale function to normalize the data in the youtuber_km. Normalization is the transformation of data for each feature (column) into data with a mean of 0 and a standard deviation of 1
youtuber_km <- filtered_data[,cluster_canidate_columns]
scaled_df<-scale(youtuber_km)
NbClust
The NbClust function is a tool used to determine the optimal number of clusters. ,whihch provides 30 different clustering validation metrics for the dataset
df_nb <- as.data.frame(table(nc$Best.n[1,]))
# In order to give the maximum bar an orange hue.
df_nb <- df_nb %>%
mutate(max_y = ifelse(Freq == max(Freq), "max", "not_max"))
# Plot
ggplot(df_nb, aes(x=Var1, y=Freq)) +
geom_bar(stat="identity", aes(fill=max_y)) +
scale_fill_manual(values = c("not_max" = "lightblue", "max" = "orange")) +
labs(
x = "Number of Clusters",
y = "Number of Criteria",
title = "Number of Clusters Chosen by 30 Criteria"
) +
theme_minimal() +
theme(legend.position="none")
Perform k-means clustering on the clusters ranging from 1 to 10 and
analyze their Calinski-Harabasz Index (CH) and
Within-Cluster Sum of Squares (WSS) to determine the
optimal K value.
Our distance measure will select Euclidean Distance, as it is suitable for various scenarios, especially for situations involving continuous numerical data. It is intuitive, simple, and has symmetry
krange <- 1:10
# WSS,Within-Cluster Sum of Squares
wss <- numeric(length(krange))
# Calinski-Harabasz
ch <- numeric(length(krange))
#
KCluster_list <- list()
set.seed(102101)
dist_matrix <- dist(scaled_df)
for (k in krange) {
K_cluster <- kmeans(scaled_df, centers = k, nstart = 25)
KCluster_list[[k]] <- K_cluster
wss[k] <- K_cluster$tot.withinss
if (k > 1) {
stats <- cluster.stats(dist_matrix, K_cluster$cluster)
ch[k] <- stats$ch
} else {
ch[k] <- 0
}
}
df_wss <- data.frame(
k = krange,
Value = wss,
Metric = "WSS"
)
# Data frame for CH values
df_ch <- data.frame(
k = krange,
Value = ch,
Metric = "CH"
)
plot_ch <- ggplot(df_ch, aes(x=k, y=Value)) +
geom_line(color="blue", size=1) +
geom_point(color="blue", size=3) +
labs(title="CH Index for k-means clustering", x="Number of clusters (k)", y="CH Value") +
theme_minimal() +
scale_x_continuous(breaks = krange) # Ensure x-axis has integer breaks
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Plot using ggplot2
plot_wss <- ggplot(df_wss, aes(x=k, y=Value)) +
geom_line(color="red", size=1) +
geom_point(color="red", size=3) +
labs(title="WSS for k-means clustering", x="Number of clusters (k)", y="WSS Value") +
theme_minimal() +
scale_x_continuous(breaks = krange) # Ensure x-axis has integer breaks
grid.arrange(plot_ch, plot_wss, nrow=1)
According to WSS’s elbow point, k=6 has a reduced rate that slows down
significantly, while in CH, 6 is the high peak of a certain stage and is
around the highest peak of 8. Combining the k=2 obtained by using
Nbclust, we show the results for k=2 and k=6 respectively.
# k=6
fit_km_6 <- KCluster_list[[6]]
# k=2
fit_km_2<-KCluster_list[[2]]
cat("2 Cluster distribution:", fit_km_2$size, "\n")
## 2 Cluster distribution: 43 781
cat("6 Cluster distribution :", fit_km_6$size, "\n")
## 6 Cluster distribution : 2 7 351 25 78 361
Average data for each cluster (raw data before standardization)
k-value=6
# mean
aggregate(youtuber_km[,-highly_correlated_features], by=list(fit_km_6$cluster), mean)
## Group.1 subscribers video_views uploads video_views_for_the_last_30_days
## 1 1 20850000 1454061765 665.000 6368500000
## 2 2 142500000 119393562411 48558.571 1642953429
## 3 3 20008547 9420629194 4636.946 75292816
## 4 4 22376000 14129278741 193007.280 208390640
## 5 5 43120513 28618986117 14394.128 658790346
## 6 6 19457341 7583340808 3096.427 97104136
## avg_earnings created_year
## 1 0 2011.500
## 2 41850129 2008.143
## 3 1877638 2008.886
## 4 5320993 2009.240
## 5 15891364 2013.090
## 6 2426103 2015.532
We have discovered the following characteristics in the 6 clusters:
Cluster 1: Users have the highest video_views_for_the_last_30_days but zero income. This is not an anomaly on YouTube as some accounts belong to non-profit or official channels that don’t generate any revenue due to company regulations.
Cluster 2: Users with the highest video_views, subscribers and income are most likely celebrities or performers who have demonstrated exceptional performance.
Clusters 3, 4 and 5: Performance is average and can be categorized into three performance levels, with Cluster having the highest average income.
Cluster 6: This is the most recently created group of YouTubers and has the least amount of subscribers, indicating that time and subscribers are factors to consider.”
From a business perspective, Cluster is relatively reasonable.
k-value=2
aggregate(youtuber_km[,-highly_correlated_features], by=list(fit_km_2$cluster), mean)
## Group.1 subscribers video_views uploads video_views_for_the_last_30_days
## 1 1 60781395 46956696890 34090.698 1280423000
## 2 2 20992958 9538111215 9690.963 111725057
## avg_earnings created_year
## 1 25081751 2012.465
## 2 2716852 2012.192
From a data perspective, it is evident that there are simply two distinct tiers: outstanding and average. In order to control for variables, the model has essentially standardized the average channel creation duration across the board.
This displays how data points are grouped into different clusters based on their characteristics and the center location of each cluster.
fviz_cluster(list(data = scaled_df, cluster = fit_km_6$cluster),
stand = FALSE,
ellipse.type = "euclid",
geom = "point",
show.clust.cent = TRUE, # Displays cluster centers
palette = c("#2E9FDF", "#00AFBB","#E7B800", "#FC4E07","#f5ccd8","#8E44AD"),
main = "K-means Cluster plot with Centers",
ggtheme = theme_minimal()
)
## Too few points to calculate an ellipse
fviz_cluster(list(data = scaled_df, cluster = fit_km_2$cluster),
stand = FALSE,
ellipse.type = "euclid",
geom = "point",
show.clust.cent = TRUE, # 显示聚类中心
palette = c("#2E9FDF", "#00AFBB","#E7B800"),
main = "K-means Cluster plot with Centers",
ggtheme = theme_minimal()
)
The value of k is 2. This graph is consistent with the data description
provided earlier. There is a clear boundary between Cluster 2 and
Cluster 1. Additionally, it can be observed that it includes some
extreme values.
Overall, in my opinion, 6-Cluster has higher business value because each of its six clusters has a distinct feature. However, on the other hand, 2-Cluster can yield a good average performance for a YouTuber.
The average silhouette width is an indicator for assessing the consistency of clustering results. Its value range is from -1 to 1, with higher values indicating better clustering quality, meaning that data points are closer within their own clusters and more distant from data points in other clusters.
stats <- cluster.stats(dist(scaled_df)^2, fit_km_6$cluster)
sli1_km<- stats$avg.silwidth# 1.使用fpc包来计算km_result的 avg.silwidth(平均的轮廓宽度)
stats$avg.silwidth
## [1] 0.4534119
A silhouette width of 0.45 indicates that, on average, the data points in the clusters are closer to each other than to points in the neighboring clusters, but there might still be some overlap or misclassification. It suggests a moderate clustering structure.
stats <- cluster.stats(dist(scaled_df)^2, fit_km_2$cluster)
sli1_km<- stats$avg.silwidth# 1.使用fpc包来计算km_result的 avg.silwidth(平均的轮廓宽度)
stats$avg.silwidth
## [1] 0.8537964
A silhouette width of 0.85 is a high value, suggesting that the data points are very close to the other points in their cluster and far away from points in the other clusters. This indicates a strong, well-defined clustering structure.
sil_km_6 <- silhouette(fit_km_6$cluster, dist(scaled_df)^2)
sil_km_2 <- silhouette(fit_km_2$cluster, dist(scaled_df)^2)
plot_km_6 <- fviz_silhouette(sil_km_6)
## cluster size ave.sil.width
## 1 1 2 0.97
## 2 2 7 0.18
## 3 3 351 0.44
## 4 4 25 0.58
## 5 5 78 -0.13
## 6 6 361 0.58
plot_km_2 <- fviz_silhouette(sil_km_2)
## cluster size ave.sil.width
## 1 1 43 -0.21
## 2 2 781 0.91
grid.arrange(plot_km_6, plot_km_2, ncol = 2)
There is a possibility that the Silhouette width in the plot_km_6 and
plot_km_2 sections is less than -0.5 due to the following reasons:
Issues with data distribution: Some data points may be isolated in space or distant from other clusters formed by data points, causing them to be wrongly assigned to other clusters. (Some data points indeed demonstrate this in the K-means Clustering Plot)
The features used may not have captured the true structure of the data, or some irrelevant or noisy features may have interfered with the clustering proces
We used the YouTuber data in the LMS of 4009. After undergoing data analysis in Project 1, we conducted ML modeling in this project. Firstly, we built a classification model to predict whether a YouTuber’s income belongs to a high or low level. For this part, we established decision tree and XGBoost models, and experimented with three sets of different features for modeling and comparison. Lastly, we performed K-means clustering using individual YouTubers as dimensions to cluster them.